Hands-on Exercise 3.1: Processing and Visualising Flow Data

Author

NeoYX

Published

November 30, 2023

Modified

December 2, 2023

15.1 Overview

Spatial interaction represent the flow of people, material, or information between locations in geographical space. It encompasses everything from freight shipments, energy flows, and the global trade in rare antiquities, to flight schedules, rush hour woes, and pedestrian foot traffic.

Each spatial interaction, as an analogy for a set of movements, is composed of a discrete origin/destination pair. Each pair can be represented as a cell in a matrix where rows are related to the locations (centroids) of origin, while columns are related to locations (centroids) of destination. Such a matrix is commonly known as an origin/destination matrix, or a spatial interaction matrix.

In this hands-on exercise, build an OD matrix by using Passenger Volume by Origin Destination Bus Stops data set downloaded from LTA DataMall.

By the end to this hands-on exercise, we will be able to:

  • to import and extract OD data for a selected time interval,

  • to import and save geospatial data (i.e. bus stops and mpsz) into sf tibble data frame objects,

  • to populate planning subzone code into bus stops sf tibble data frame,

  • to construct desire lines geospatial data from the OD data, and

  • to visualise passenger volume by origin and destination bus stops by using the desire lines data.

15.2 Getting Started

For the purpose of this exercise, four R packages will be used. They are:

  • sf for importing, integrating, processing and transforming geospatial data.

  • tidyverse for importing, integrating, wrangling and visualising data.

  • tmap for creating thematic maps

  • stplanr for solving common problems in transport planning and modelling, such as how to best get from point A to point B

  • ggpubr for some easy-to-use functions for creating and customizing ‘ggplot2’- based publication ready plots.

  • performance for for computing measures to assess model quality, which are not directly provided by R’s ‘base’ or ‘stats’ packages. The primary goal of the performance package is to provide utilities for computing indices of model quality and goodness of fit. These include measures like r-squared (R2), root mean squared error (RMSE)

pacman::p_load(tmap, sf, DT, stplanr,
               performance,
               ggpubr, tidyverse)

15.3 Preparing the Flow Data

15.3.1 Importing the OD data

Import the Passenger Volume by Origin Destination Bus Stops data set downloaded from LTA DataMall by using read_csv() of readr package.

odbus <- read_csv("data/aspatial/origin_destination_bus_202308.csv")

Display the odbus tibble data table by using the code chunk below.

glimpse(odbus)
Rows: 5,709,512
Columns: 7
$ YEAR_MONTH          <chr> "2023-08", "2023-08", "2023-08", "2023-08", "2023-…
$ DAY_TYPE            <chr> "WEEKDAY", "WEEKENDS/HOLIDAY", "WEEKENDS/HOLIDAY",…
$ TIME_PER_HOUR       <dbl> 16, 16, 14, 14, 17, 17, 17, 17, 7, 17, 14, 10, 10,…
$ PT_TYPE             <chr> "BUS", "BUS", "BUS", "BUS", "BUS", "BUS", "BUS", "…
$ ORIGIN_PT_CODE      <chr> "04168", "04168", "80119", "80119", "44069", "4406…
$ DESTINATION_PT_CODE <chr> "10051", "10051", "90079", "90079", "17229", "1722…
$ TOTAL_TRIPS         <dbl> 7, 2, 3, 10, 5, 4, 3, 22, 3, 3, 7, 1, 3, 1, 3, 1, …

A quick check of odbus tibble data frame shows that the values in ORIGIN_PT_CODE and DESTINATON_PT_CODE are in numeric data type. Hence, the code chunk below is used to convert these data values into character data type.

odbus$ORIGIN_PT_CODE <- as.factor(odbus$ORIGIN_PT_CODE)
odbus$DESTINATION_PT_CODE <- as.factor(odbus$DESTINATION_PT_CODE) 

15.3.2 Extracting the study data

The data in odbus is generalised into weekend and weekday data. For the purpose of this exercise, we will extract commuting flows on weekday and between 6 and 9 o’clock. After the group-by and sum, the total rows reduced from 5,709,512 ro 241,503.

odbus6_9 <- odbus %>%
  filter(DAY_TYPE == "WEEKDAY") %>%
  filter(TIME_PER_HOUR >= 6 &
           TIME_PER_HOUR <= 9) %>%
  group_by(ORIGIN_PT_CODE,
           DESTINATION_PT_CODE) %>%
  summarise(TRIPS = sum(TOTAL_TRIPS))

Print the content of odbus6_9

datatable(odbus6_9,
          class = 'cell-border stripe',
          options = list(pageLength = 5))

If we would like to, we could save the output in rds format for future use. We need to ensure that there exists a folder called ‘rds’ in ‘data’ folder before running the code chunk.

write_rds(odbus6_9, "data/rds/odbus6_9.rds")

To read rds files:

odbus6_9 <- read_rds("data/rds/odbus6_9.rds")

15.4 Working with Geospatial Data

In this exercise, two geospatial data will be used. They are:

  • BusStop: This data provides the location of bus stop as at the third quarter of 2023. This data is refreshed quarterly by LTA. The last update was in July 2023.

  • MPSZ-2019: This data provides the sub-zone boundary of URA Master Plan 2019.

Both data sets are in ESRI shapefile format.

15.4.1 Importing geospatial data

Load the BusStop geospatial data using the st_read() function of sf package. Using the st_crs(busstop) will show that the coordinate system used is WSG84 (decimal deg). Using st_tranform(), we will convert the geographical coordinates system to SIngapore’s projected coordinate system crs=3414.

Note that there are repeated bus stop ids , however they have different bus stop roof ids and geometry values.

busstop <- st_read(dsn = "data/geospatial/BusStopLocation/BusStopLocation_Jul2023",
                   layer = "BusStop") %>%
  st_transform(crs = 3414)
Reading layer `BusStop' from data source 
  `C:\yixin-neo\ISSS624_AGA\Hands-on_Ex3\data\geospatial\BusStopLocation\BusStopLocation_Jul2023' 
  using driver `ESRI Shapefile'
Simple feature collection with 5161 features and 3 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 3970.122 ymin: 26482.1 xmax: 48284.56 ymax: 52983.82
Projected CRS: SVY21

Next load the mpsz data. There are 332 planning subzones in Singapore.

mpsz <- st_read(dsn = "data/geospatial/MPSZ-2019",
                   layer = "MPSZ-2019") %>%
  st_transform(crs = 3414)
Reading layer `MPSZ-2019' from data source 
  `C:\yixin-neo\ISSS624_AGA\Hands-on_Ex3\data\geospatial\MPSZ-2019' 
  using driver `ESRI Shapefile'
Simple feature collection with 332 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 103.6057 ymin: 1.158699 xmax: 104.0885 ymax: 1.470775
Geodetic CRS:  WGS 84
mpsz
Simple feature collection with 332 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 2667.538 ymin: 15748.72 xmax: 56396.44 ymax: 50256.33
Projected CRS: SVY21 / Singapore TM
First 10 features:
                 SUBZONE_N SUBZONE_C       PLN_AREA_N PLN_AREA_C       REGION_N
1              MARINA EAST    MESZ01      MARINA EAST         ME CENTRAL REGION
2         INSTITUTION HILL    RVSZ05     RIVER VALLEY         RV CENTRAL REGION
3           ROBERTSON QUAY    SRSZ01  SINGAPORE RIVER         SR CENTRAL REGION
4  JURONG ISLAND AND BUKOM    WISZ01  WESTERN ISLANDS         WI    WEST REGION
5             FORT CANNING    MUSZ02           MUSEUM         MU CENTRAL REGION
6         MARINA EAST (MP)    MPSZ05    MARINE PARADE         MP CENTRAL REGION
7                   SUDONG    WISZ03  WESTERN ISLANDS         WI    WEST REGION
8                  SEMAKAU    WISZ02  WESTERN ISLANDS         WI    WEST REGION
9           SOUTHERN GROUP    SISZ02 SOUTHERN ISLANDS         SI CENTRAL REGION
10                 SENTOSA    SISZ01 SOUTHERN ISLANDS         SI CENTRAL REGION
   REGION_C                       geometry
1        CR MULTIPOLYGON (((33222.98 29...
2        CR MULTIPOLYGON (((28481.45 30...
3        CR MULTIPOLYGON (((28087.34 30...
4        WR MULTIPOLYGON (((14557.7 304...
5        CR MULTIPOLYGON (((29542.53 31...
6        CR MULTIPOLYGON (((35279.55 30...
7        WR MULTIPOLYGON (((15772.59 21...
8        WR MULTIPOLYGON (((19843.41 21...
9        CR MULTIPOLYGON (((30870.53 22...
10       CR MULTIPOLYGON (((26879.04 26...
Note
  • st_read() function of sf package is used to import the shapefile into R as sf data frame.

  • st_transform() function of sf package is used to transform the projection to crs 3414.

15.5 Geospatial data wrangling

15.5.1 Combining Busstop and mpsz

Code chunk below populates the planning subzone code (i.e. SUBZONE_C) of mpsz sf data frame into busstop sf data frame. The output of st_intersection() is a point sf object. We do not need and therefore will drop the geometry. The number of observations reduced from 5,161 to 5,156 after operation, suggesting that 5 bus stops have been dropped as their point geometry is not within the polygon boundary of sf df mpsz.

busstop_mpsz <- st_intersection(busstop, mpsz) %>%
  select(BUS_STOP_N, SUBZONE_C, LOC_DESC) %>%
  st_drop_geometry()
Note
  • st_intersection() is used to perform point and polygon overly and the output will be in point sf object.

  • select() of dplyr package is then use to retain only BUS_STOP_N and SUBZONE_C in the busstop_mpsz sf data frame.

  • five bus stops are excluded in the resultant data frame because they are outside of Singapore boundary.

datatable(busstop_mpsz,
          class = 'cell-border stripe',
          options = list(pageLength = 5))

Save the output into rds format

write_rds(busstop_mpsz, "data/rds/busstop_mpsz.rds")  

Next, we are going to append the planning subzone code from busstop_mpsz data frame onto odbus6_9 data frame. By doing so, we get the fields ‘ORIGIN_BS’, ‘DESTIN_BS” and ’ORIGIN_SZ’ all in a df .

busstop_mpsz %>%
  group_by(BUS_STOP_N, SUBZONE_C) %>%
  filter(n()>1) %>%
  ungroup()
# A tibble: 26 × 3
   BUS_STOP_N SUBZONE_C LOC_DESC       
   <chr>      <chr>     <chr>          
 1 11009      QTSZ01    Ghim Moh Ter   
 2 11009      QTSZ01    GHIM MOH TER   
 3 82221      GLSZ05    BLK 3A         
 4 82221      GLSZ05    Blk 3A         
 5 22501      JWSZ09    Blk 662A       
 6 22501      JWSZ09    BLK 662A       
 7 96319      TMSZ05    Yusen Logistics
 8 96319      TMSZ05    YUSEN LOGISTICS
 9 43709      BKSZ07    BLK 644        
10 43709      BKSZ07    BLK 644        
# ℹ 16 more rows

The join columns will be ‘ORIGIN_PT_CODE’ from odbus6_9 df and ‘BUS_STOP_N’ from busstop_mpsz df. The columns will also be renamed.

Before left_join, odbus6_9 has 241,503 rows, after left join od_data has 242,235 rows.

od_data <- left_join(odbus6_9 , busstop_mpsz,
            by = c("ORIGIN_PT_CODE" = "BUS_STOP_N")) %>%
  rename(ORIGIN_BS = ORIGIN_PT_CODE,
         ORIGIN_SZ = SUBZONE_C,
         DESTIN_BS = DESTINATION_PT_CODE)

Check for duplicate for proceeding

duplicate <- od_data %>%
  group_by_all() %>%
  filter(n()>1) %>%
  ungroup()

duplicate
# A tibble: 794 × 5
   ORIGIN_BS DESTIN_BS TRIPS ORIGIN_SZ LOC_DESC
   <chr>     <fct>     <dbl> <chr>     <chr>   
 1 43709     43009        15 BKSZ07    BLK 644 
 2 43709     43009        15 BKSZ07    BLK 644 
 3 43709     43419        42 BKSZ07    BLK 644 
 4 43709     43419        42 BKSZ07    BLK 644 
 5 43709     43469         1 BKSZ07    BLK 644 
 6 43709     43469         1 BKSZ07    BLK 644 
 7 43709     43479        62 BKSZ07    BLK 644 
 8 43709     43479        62 BKSZ07    BLK 644 
 9 43709     43489        23 BKSZ07    BLK 644 
10 43709     43489        23 BKSZ07    BLK 644 
# ℹ 784 more rows

Remove the duplicated records. The od_data df reduced from 242,235 rows to 241,838 rows after moving duplicates.

od_data <- unique(od_data)

Double check again

od_data %>%
  group_by_all() %>%
  filter(n()>1) %>%
  ungroup()
# A tibble: 0 × 5
# ℹ 5 variables: ORIGIN_BS <chr>, DESTIN_BS <fct>, TRIPS <dbl>,
#   ORIGIN_SZ <chr>, LOC_DESC <chr>

Print the current od_data df to see what we are still lacking. We are will missing the destination subzone codes.

knitr::kable(head(od_data,3))
ORIGIN_BS DESTIN_BS TRIPS ORIGIN_SZ LOC_DESC
01012 01112 276 RCSZ10 HOTEL GRAND PACIFIC
01012 01113 143 RCSZ10 HOTEL GRAND PACIFIC
01012 01121 66 RCSZ10 HOTEL GRAND PACIFIC

Again, get the destination subzone code for each destination bus stops by performing a left_join again with busstop_mpsz (contains subzone_c codes for each bus stop id).

After left_join, the number of rows increased from 241,838 rows to 242,831 rows.

od_data <- left_join(od_data , busstop_mpsz,
            by = c("DESTIN_BS" = "BUS_STOP_N"))

Check for duplicates

duplicate <- od_data %>%
  group_by_all() %>%
  filter(n()>1) %>%
  ungroup()

duplicate
# A tibble: 880 × 7
   ORIGIN_BS DESTIN_BS TRIPS ORIGIN_SZ LOC_DESC.x           SUBZONE_C LOC_DESC.y
   <chr>     <chr>     <dbl> <chr>     <chr>                <chr>     <chr>     
 1 01013     51071         1 RCSZ10    ST JOSEPH'S CH       CCSZ01    MACRITCHI…
 2 01013     51071         1 RCSZ10    ST JOSEPH'S CH       CCSZ01    MACRITCHI…
 3 01112     51071        65 RCSZ10    OPP BUGIS STN EXIT C CCSZ01    MACRITCHI…
 4 01112     51071        65 RCSZ10    OPP BUGIS STN EXIT C CCSZ01    MACRITCHI…
 5 01112     53041         5 RCSZ10    OPP BUGIS STN EXIT C BSSZ01    Upp Thoms…
 6 01112     53041         5 RCSZ10    OPP BUGIS STN EXIT C BSSZ01    Upp Thoms…
 7 01121     51071         3 RCSZ04    STAMFORD PR SCH      CCSZ01    MACRITCHI…
 8 01121     51071         3 RCSZ04    STAMFORD PR SCH      CCSZ01    MACRITCHI…
 9 01239     51071        22 KLSZ09    SULTAN PLAZA         CCSZ01    MACRITCHI…
10 01239     51071        22 KLSZ09    SULTAN PLAZA         CCSZ01    MACRITCHI…
# ℹ 870 more rows

Remove duplicates

od_data <- unique(od_data)

Sneak peak of the current od_data

knitr::kable(head(od_data,3))
ORIGIN_BS DESTIN_BS TRIPS ORIGIN_SZ LOC_DESC.x SUBZONE_C LOC_DESC.y
01012 01112 276 RCSZ10 HOTEL GRAND PACIFIC RCSZ10 OPP BUGIS STN EXIT C
01012 01113 143 RCSZ10 HOTEL GRAND PACIFIC DTSZ01 BUGIS STN EXIT B
01012 01121 66 RCSZ10 HOTEL GRAND PACIFIC RCSZ04 STAMFORD PR SCH

The code chunk below will do the following:

  1. Renames the destination ‘SUBZONE_C’ to ‘DESTIN_SZ’.

  2. There are missing subzone codes for some of the origin and destination bus stop because the bus stops location in July 2023 could be more outdated than August bus stop 2023. We will drop columns with missing values.

  3. Group-by origin subzone and destination subzone to generate a new field ‘MORNING_PEAK’ that contains the summation of all trips from subzone A to subzone B.

od_data <- od_data %>%
  rename(DESTIN_SZ = SUBZONE_C) %>%
  drop_na() %>%
  group_by(ORIGIN_SZ, DESTIN_SZ) %>%
  summarise(MORNING_PEAK = sum(TRIPS))

Take a look at our final od_data df

knitr::kable(head(od_data %>% 
                    arrange(desc(MORNING_PEAK)),
                  10))
ORIGIN_SZ DESTIN_SZ MORNING_PEAK
TMSZ02 TMSZ02 350755
WDSZ03 WDSZ03 239791
JWSZ08 JWSZ09 234343
BDSZ04 BDSZ04 217917
JWSZ09 JWSZ09 153055
YSSZ03 YSSZ01 152800
JWSZ04 JWSZ04 149110
TMSZ03 TMSZ02 134196
PRSZ05 PRSZ03 103148
JWSZ03 JWSZ04 99666

Save the output into an rds file format.

write_rds(od_data, "data/rds/od_data.rds")
od_data <- read_rds("data/rds/od_data.rds")

15.6 Visualising Spatial Interaction

In this section, wewill learn how to prepare a desire line by using stplanr package.

15.6.1 Removing intra-zonal flows

We will not plot the intra-zonal flows. The code chunk below will be used to remove intra-zonal flows. It does so by removing the flows that originate and ends in the same subzone.

Rows reduced from 20,987 to 20,697.

od_data1 <- od_data[od_data$ORIGIN_SZ!=od_data$DESTIN_SZ,]

15.6.2 Creating desire lines

In this code chunk below, od2line() of stplanr package is used to create the desire lines.

od_data1 is aspatial while mpsz is geospatial data.

Arguments

flow

A data frame representing origin-destination data. The first two columns of this data frame should correspond to the first column of the data in the zones. Thus in cents_sf(), the first column is geo_code. This corresponds to the first two columns of flow().

zones

A spatial object representing origins (and destinations if no separate destinations object is provided) of travel.

destinations

A spatial object representing destinations of travel flows.

zone_code

Name of the variable in zones containing the ids of the zone. By default this is the first column names in the zones.

The output flowLine is a sf LINESTRING object.

flowLine <- od2line(flow=od_data1,
                    zones= mpsz,
                    zone_code= 'SUBZONE_C')

flowLine
Simple feature collection with 20697 features and 3 fields
Geometry type: LINESTRING
Dimension:     XY
Bounding box:  xmin: 5105.594 ymin: 25813.33 xmax: 49483.22 ymax: 49552.79
Projected CRS: SVY21 / Singapore TM
First 10 features:
   ORIGIN_SZ DESTIN_SZ MORNING_PEAK                       geometry
1     AMSZ01    AMSZ02        11742 LINESTRING (29501.77 39419....
2     AMSZ01    AMSZ03        14886 LINESTRING (29501.77 39419....
3     AMSZ01    AMSZ04         3237 LINESTRING (29501.77 39419....
4     AMSZ01    AMSZ05         9349 LINESTRING (29501.77 39419....
5     AMSZ01    AMSZ06         2231 LINESTRING (29501.77 39419....
6     AMSZ01    AMSZ07         1714 LINESTRING (29501.77 39419....
7     AMSZ01    AMSZ08         2624 LINESTRING (29501.77 39419....
8     AMSZ01    AMSZ09         2371 LINESTRING (29501.77 39419....
9     AMSZ01    AMSZ10          183 LINESTRING (29501.77 39419....
10    AMSZ01    AMSZ11          930 LINESTRING (29501.77 39419....

15.6.3 Visualising the desire lines

To visualise the resulting desire lines, the code chunk below is used.

Arguments of tm_lines():

lwd: line width. Either a numeric value or a data variable. In the latter case, the class of the highest values (see style) will get the line width defined by scale. If multiple values are specified, small multiples are drawn (see details).

style: method to process the color scale when col is a numeric variable. Discrete gradient options are "cat", "fixed", "sd", "equal", "pretty", "quantile", "kmeans", "hclust", "bclust", "fisher", "jenks", "dpih", "headtails", and "log10_pretty". A numeric variable is processed as a categorical variable when using "cat", i.e. each unique value will correspond to a distinct category

scale: line width multiplier number.

n: preferred number of color scale classes. Only applicable when lwd is the name of a numeric variable.

tm_shape(mpsz) +
  tm_polygons() +
  flowLine %>% 
  tm_shape() +
  tm_lines(lwd = 'MORNING_PEAK',
           style = 'quantile',
           scale= c(0.1, 1, 3, 5, 7, 10),
           n = 6,
           alpha= 0.3)

Warning

Rendering process takes about 1 min because of the transparency argument alpha.

When the flow data are very messy and highly skewed like the one shown above, it is wiser to focus on selected flows, for example flow greater than or equal to 5000 as shown below.

tmap_mode('view')
tmap_options(check.and.fix = TRUE)

tm_shape(mpsz) +
  tm_polygons() +
flowLine %>%  
  filter(MORNING_PEAK >= 5000) %>%
tm_shape() +
  tm_lines(lwd = "MORNING_PEAK",
           style = "quantile",
           scale = c(0.1, 1, 3, 5, 7, 10),
           n = 6,
           alpha = 0.3)
ttm()

tm_shape(mpsz) +
  tm_polygons() +
flowLine %>%  
  filter(MORNING_PEAK >= 5000) %>%
tm_shape() +
  tm_lines(lwd = "MORNING_PEAK",
           style = "quantile",
           scale = c(0.1, 1, 3, 5, 7, 10),
           n = 6,
           alpha = 0.3)

#| eval: false
#| echo: false
#| fig-width: 14
#| fig-asp: 0.68
#| code-fold: True

Summaries

OD matrix is often incomplete. Imagine trying to complete the OD matrix, it would involve us doing spatial interaction or OD surveys to find the missing values. There are 332 subzones in Singapore, and each survey is expensive,. In additon, OD matrix is constantly changing as flow patterns changes. We are trying to predict flows between origins and destinations. Flow could be thought of a function of (1) attribute of origin (propulsiveness) (2) attribute of destination (attractiveness) and (3) cost friction (like distance or transport cost or public transport stops). Assumption is that the benefits must outweigh the cost in order for flow to happen.

Gravity model takes into consideration the interaction between all origin and destination locations.

Potential model takes in consideration the interaction between a location and all other location pairs. (Good for measuring accessibility)

Retail model is commonly used by franchise like KFC / Pizza Hut to determine their area/region of service (aka delivery zones) for each outlet.

There are 4 variations in the Gravity model:

  1. Uncontrained: only the overall outflow is fixed and total outflow from origins = total inflow to destinations
  2. Origin constrained: outflow by origin is fixed.
  3. Destination constrained: inflow by destination is fixed.
  4. Doubly constrained: outflow by origin and inflow by destination is fixed.

To calculate flow from each origin to each destination, we need parameters like k, alpha, lambda and beta. The beta for distance is usually negative because we assume that there is an inverse relationship between interaction and distance, like Newtonian physics and laws of gravity.